"Made micro-optimizations to core math libraries"
Spoiler- eliminated several instructions from matrix multiply for Neon and improved scheduling for all platforms
- multiplexed scalar SinCos to do both calculations at the same time (213% faster on my PC)
- translated scalar SinCos to float32x2_t for Switch (27% faster)
- fused the sin and cos calls in AngleToVector
- made scaling matrix functions about half as many instructions for Neon
- changed Noen Matrix <-> Matrix16 to use structured-load/store instructions
- Changed determinant function to use vdot4 where it can be done with horizontal adds
- cache-line aligned some of the matrix constants
- simplified Neon Matrix::ScaleAxes(Vector) to match other platforms
- shaved a few instructions and a memory acces...